Hypothesis testing
With the t-test, we perform what is called two means hypothesis tests . We define what is called the null hypothesis , which is what we assume by default or the baseline, that there is no difference in the two means:
Ho: μ1 = μ2
μ1 is the true mean of one group
μ2 is the true mean of the other group
Thus like in law, we fail to reject the null this null hypothesis, unless we have enough evidence to suggest that we should reject it - similarly to the idea that we by default assume that individuals are not guilty until proven otherwise. If we reject the null hypothesis, we then accept an alternative hypothesis , this can vary for different statistical tests, but in the case of the t-test we evaluate if the two means are not equal:
Ha: μ1 ≠ μ2
Remember that we can’t know the true mean for our populations of interest, instead we approximate or estimate our means based on the sample that we have in our data.
When we use a statistical test to evaluate hypotheses like these, we use what we call the p-value. The p-value is a measure of the strength of the evidence for our null hypothesis. It is common practice to consider a p-value < 0.05 a strong enough evidence against the null hypothesis to reject it. Alternatively if the p-value > 0.5 then there is not enough evidence to reject the null hypothesis. We will use the p-values from our statistical tests to decide how to interpret the results of our tests.
We are interested in comparing the means of female rural and urban BMI measurements for both years.
There are two possible classes of statistical tests that we could run to compare the means of these two groups:
- Parametric
- Nonparametric
Parametric tests are based on assumptions about the distribution of the data, while Nonparametric tests do not rely on this assumption. It is called “parametric” because aspects about the distribution of the data like the mean are called parameters when we describe a population. In parametric tests we estimate the parameters of the true population of interest using a sample of that population. These estimates are called statistics. See here for more information about the difference between these two classes of tests.
Parametric two sample mean tests
Often when comparing two groups we might perform a two sample t-test to determine if the means of each group is different. The two sample t-test however, relies on several assumptions:
- The data for both groups is normally distributed
- The variance of both groups is similar
- The number of observations is similar for both groups - thus they are balanced
- The observations are independent (meaning that observations do not influence each other)
If these assumptions are violated, this doesn’t necessarily mean we can’t perform a t-test. It just means we may need to consider the following options:
- Transformation of the data to make it more normally distributed
- Welch’s t-test also call the unequal variance t-test we may need to modify the way we perform the t-test to account for the difference in the variance in the two groups
- Permutation/resampling methods to deal with violations of normality or imbalance.
Alternatively, we can use a nonparametric test which is model free and does not rely on assumptions about the parameters of the data. These tests are often a good option when multiple assumptions are violated are when sample sizes are small. We will explore these options.
Our data has a balance of observations for both groups - in fact they are equal, thus that assumption is not violated. If it were violated, we would want to consider using permutation methods which are also a good option for violations of normality. To learn more about these methods see here.
If we needed to check if our samples were imbalanced, we could use the count() function of dplyr:
# A tibble: 12 x 4
Sex Year Region n
<chr> <chr> <chr> <int>
1 Men 1985 National 200
2 Men 1985 Rural 200
3 Men 1985 Urban 200
4 Men 2017 National 200
5 Men 2017 Rural 200
6 Men 2017 Urban 200
7 Women 1985 National 200
8 Women 1985 Rural 200
9 Women 1985 Urban 200
10 Women 2017 National 200
11 Women 2017 Rural 200
12 Women 2017 Urban 200
We can see that the number of observations for each possible group of interest is the same.
The t-test is also fairly robust to non-normality if the data is relatively large, due to what is called the central limit theorem, which states that as samples get larger, they approximate a normal distribution.
We have an n of 200, which should be sufficient but let’s investigate the nonparametric tests further.
Often we would check if the variance of the rural and urban data is equal using the var.test() function. However this is an F test and assumes that the data is normally distributed. Instead we will use the mood.test() function which performs the Mood’s two-sample test for a difference in scale parameters and does not assume that the data is normally distributed. We will also introduce the pull() function of the dplyr package.
Mood two-sample test of scale
data: dplyr::pull(filter(BMI_long, Sex == "Women", Year == "2017", and dplyr::pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and Region == "Urban"), BMI)
Z = 2.9189, p-value = 0.003513
alternative hypothesis: two.sided
# p value <.05, conclude that variance is not equal
# reject the null: no difference in the spread of the distributions
mood.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long,
Sex == "Women",
Year == "1985",
Region == "Urban"), BMI))
Mood two-sample test of scale
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI) and "Urban"), BMI)
Z = 3.1305, p-value = 0.001745
alternative hypothesis: two.sided
# p value <.05, conclude that variance is not equal
# reject the null: no difference in the variance of the distributions
mood.test(pull(filter(BMI_long,
Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI))
Mood two-sample test of scale
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Rural"), BMI)
Z = -0.24228, p-value = 0.8086
alternative hypothesis: two.sided
# p value >.05, conclude that variance is equal
# fail to reject the null the null: no difference in the variance of the distributions
mood.test(pull(filter(BMI_long,
Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI))
Mood two-sample test of scale
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI) and "Urban"), BMI)
Z = 1.5317, p-value = 0.1256
alternative hypothesis: two.sided
Our p value is less than .05 for both tests, thus we reject our null hypothesis that there is no difference in the variance. Therefore, we conclude that the variance is not equal and that our data also violates this assumption.
We will perform a special t.test where we account for the fact that our variance is not equal.
Another important consideration is that the data is what we call paired, meaning that the measurements from the rural and urban areas are not independent. That is because we have a rural and urban measurement mean for nearly every country. Thus these values may be more similar to one another if they come from the same country. This is also true for the male and female measurements from the same country or the values in the same countries from 1985 and later in 2017. However, we are assuming that measurements between different countries are independent, thus this assumption is not violated, making it reasonable to perform the Welch’s or paired t-test.
When we perform a paired t-test our hypothesis is slightly different from the typical student’s t-test. In this case we are testing the differences among the pairs of observations and how close these differences are to zero. Our null hypothesis is that the mean of the differences is equal to zero:
Ho: μd = 0
μd is the true mean differences
between paired observations of the two groups
The alternative hypothesis is that the mean of the differences is not equal to zero
Ha: μd ≠ 0
μd is the true mean differences
between paired observations of the two groups
t.test(pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Urban"), BMI)
t = -10.356, df = 194, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.0573625 -0.7190478
sample estimates:
mean of the differences
-0.8882051
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI) and "Urban"), BMI)
t = -14.095, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.1870263 -0.8956268
sample estimates:
mean of the differences
-1.041327
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI),
var.equal = TRUE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Rural"), BMI)
t = -22.119, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.591762 -2.167422
sample estimates:
mean of the differences
-2.379592
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI),
var.equal = TRUE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI) and "Urban"), BMI)
t = -24.378, df = 198, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.383938 -2.027118
sample estimates:
mean of the differences
-2.205528
Question opportunity: Looking at the t value, was global BMI lower in Rural or Urban areas in 1985?
Now we will try transform our data to make it more normally distributed. One way to do this is to take the logarithm of the data values. Then we will see how this influences the results. Again we will focus on the data for women.

# A tibble: 6 x 3
# Groups: Year [2]
Year Region shapiro_test
<chr> <chr> <dbl>
1 1985 National 0.00315
2 1985 Rural 0.0105
3 1985 Urban 0.000334
4 2017 National 0.293
5 2017 Rural 0.0784
6 2017 Urban 0.00416
# A tibble: 6 x 3
# Groups: Year [2]
Year Region shapiro_test
<chr> <chr> <dbl>
1 1985 National 0.0000478
2 1985 Rural 0.000679
3 1985 Urban 0.000000332
4 2017 National 0.00363
5 2017 Rural 0.0108
6 2017 Urban 0.0000130
The data appears to be more similar to the normal distribution, although not quite. Again, our sample size of 200 is quite large and the t-test is generally quite robust to violations of normality with large n, thus the modified t-test to account for unequal variance might be a good option using the log normalized data, as it is at least more normally distributed.
Let’s see the results of the t-test with the transformed data:
t.test(pull(filter(BMI_long_log, Sex == "Women",
Year == "2017",
Region == "Rural"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women",
Year == "2017",
Region == "Urban"), log_BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Rural"), log_BMI) and "Urban"), log_BMI)
t = -10.058, df = 194, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.04242774 -0.02851589
sample estimates:
mean of the differences
-0.03547182
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long_log, Sex == "Women",
Year == "1985",
Region == "Rural"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women",
Year == "1985",
Region == "Urban"), log_BMI),
var.equal = FALSE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == "Rural"), log_BMI) and "Urban"), log_BMI)
t = -13.962, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.05214677 -0.03923811
sample estimates:
mean of the differences
-0.04569244
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long_log, Sex == "Women",
Year == "1985",
Region == "Rural"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women",
Year == "2017",
Region == "Rural"), log_BMI),
var.equal = TRUE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Rural"), log_BMI) and "Rural"), log_BMI)
t = -22.369, df = 195, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.10617051 -0.08896626
sample estimates:
mean of the differences
-0.09756839
# means are different - p value <.05 reject the null: no difference in the means
t.test(pull(filter(BMI_long_log, Sex == "Women",
Year == "1985",
Region == "Urban"), log_BMI),
pull(filter(BMI_long_log, Sex == "Women",
Year == "2017",
Region == "Urban"), log_BMI),
var.equal = TRUE, paired = TRUE)
Paired t-test
data: pull(filter(BMI_long_log, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long_log, Sex == "Women", Year == "2017", Region == "Urban"), log_BMI) and "Urban"), log_BMI)
t = -23.977, df = 198, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.09377834 -0.07952498
sample estimates:
mean of the differences
-0.08665166
We can see that our results are quite similar to that of the original data, however the t values are slightly smaller. In other cases we may see a much more dramatic influence of transforming our data.
Now, let’s take a look at nonparametric tests, which are also a great option when the assumptions of the t-test are violated.
Nonparametric two sample tests
There are multiple nonparametric options to consider when the assumptions of the [t-test] are violated. The Wilcoxon signed rank test (for paired data - the alternative is Wilcoxon rank sum test (also called the Mann-Whitney U test) for independent samples) and the two-sample Kolmogorov-Smirnov test (KS) both do not assume normality (has both paired and unpaired methods). Thus these tests should be considered when the data of either groups does not appear to be normally distributed and particularly when the number of samples is low.
Importantly the KS test does not assume normality or equal variance, while the Wilcoxon signed rank test does assume equal variance. Here is how you would perform these tests. However in our case, because the variance is not equal between some of our groups of interest, the KS test would be more appropriate. Both the t-test and the KS test evaluate if the distributions of the two groups are identical, however the KS test does not particularly test any aspect of the distribution like the mean, therefore there are no confidence intervals in the output using this test.
ks.test(pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI),
paired = TRUE)
Two-sample Kolmogorov-Smirnov test
data: pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Urban"), BMI)
D = 0.20006, p-value = 0.0007385
alternative hypothesis: two-sided
ks.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
paired = TRUE)
Two-sample Kolmogorov-Smirnov test
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == "Rural"), BMI) and "Urban"), BMI)
D = 0.19914, p-value = 0.0007779
alternative hypothesis: two-sided
What about the difference in female BMI from 1985 to 2017 for both regions? Recall that the variance was equal for these comparisons.
wilcox.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Rural"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Rural"), BMI),
paired = TRUE)
Wilcoxon signed rank test with continuity correction
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Rural"), BMI) and "Rural"), BMI)
V = 273, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
wilcox.test(pull(filter(BMI_long, Sex == "Women",
Year == "1985",
Region == "Urban"), BMI),
pull(filter(BMI_long, Sex == "Women",
Year == "2017",
Region == "Urban"), BMI),
paired = TRUE)
Wilcoxon signed rank test with continuity correction
data: pull(filter(BMI_long, Sex == "Women", Year == "1985", Region == and pull(filter(BMI_long, Sex == "Women", Year == "2017", Region == "Urban"), BMI) and "Urban"), BMI)
V = 189, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
There is a significant difference across time for both regions, as we saw with the t-test. There is also a significant difference by region for each year. However, the p-values are a bit larger for the KS test results than we saw with the t-test.
p-values
The p in p-value stands for probability - the probability that we would obtain the statistics (for example the t in our student t-tests based on the means of our groups of comparison) just by random chance alone. Therefore a p-value of 0.02 means that there is a 2 percent chance that out data may look the way it does just because of random chance and not because there is really a difference in the means of the groups of interest.
So then if alpha is the threshold for the p-value for obtaining false significant results then the probability of not making incorrect conclusions is 1 - alpha:
P(Not making an error) = 1 - α
[1] 0.95
OK so if we use an alpha of .05 we are accepting that 95% of the time we will not make a Type 1 error and 5 % of the time we will just by random chance.
Taking this one step further:
P(Making an error) = 1 - P(Not making an error)
P(Making an error) = 1 - (1 - α)
Here we can see that this checks out:
[1] 0.05
So what happens if we perform multiple tests?
The probability of not making a type 1 error would remain the same for each test. Therefore we need to multiply the probabilities together each time to determine the over all probability of making an error across multiple tests. See here about why we multiply probabilities together.
P(Not making an error in m tests) = (1 - α)^m
P(Making at least 1 error in m tests) = 1 - (1 - α)^m
Let’s consider if we performed 10 different statistical tests and if we as usual considered the significance threshold alpha of .05:
So the probability of getting 1 significant result with 10 tests is:
[1] 0.4012631
So there is a 40% chance that that there will be a significant finding simply due to random chance alone.
What about 100 tests?
[1] 0.9940795
Yikes!! That is almost a 100% chance that there will be a significant finding simply due to chance alone!
Much of this explanation is described in this lecture
One way to correct for this is multiple testing issue is using the Bonferroni method.
In this method we would divide our significance threshold (generally 0.05) by the number of tests.
[1] 0.0125
Our new significance threshold is now 0.0125. Thus our p-values should be less than this value for us to reject the null that there is no difference in means. In all cases, our p-values were less than 0.0125. So we see a significant difference in the means of the groups after multiple testing correction for our different tests.
Again, it would be reasonable to use the t-test because it is robust to deviations in normality when samples are relatively large. We can see that we obtained the same results regardless of the test that we used. However, if sample sizes are small (generally speaking n<15 for each group), then these nonparametric options are useful to know.
The D values in the output of our KS tests, show the magnitude of distance in the difference between the distributions of the groups tested. You may notice that the D value is larger for the tests of BMI across time rather then across region. In these tests the p-value was also smaller.